TLDR; I show how you can visualize the knowledge distribution of your source code by mining version control systems.

Introduction

In software development, it's all about knowledge – both technical and the business domain. But we software developers transfer only a small part of this knowledge into code. But code alone isn't enough to get a glimpse of the greater picture and the interrelations of all the different concepts. There will be always developers that know more about some concept as laid down in source code. It's important to make sure that this knowledge is distributed over more than one head. More developers mean more different perspectives on the problem domain leading to a more robust and understandable code bases.

How can we get insights about knowledge in code?

It's possible to estimate the knowledge distribution by analyzing the version control system. We can use active changes in the code as proxy for "someone knew what he did" because otherwise, he wouldn't be able to contribute code at all. To find spots where the knowledge about the code could be improved, we can identify areas in the code that are possibly known by only one developer. This gives you a hint where you should start some pair programming or invest in redocumentation.

In this blog post, we approximate the knowledge distribution by counting the number of additions per file that each developer contributed to a software system. I'll show you step by step how you can do this by using Python and Pandas.

Attribution: The work is heavily inspired by Adam Tornhill's book "Your Code as a Crime Scene", who did a similar analysis called "knowledge map". I use the similar visualization style of a "bubble chart" based on his work as well.

Import history

For this analysis, you need a log from your Git repository. In this example, we analyze a fork of the Spring PetClinic project.

To avoid some noise, we add the parameters --no-merges and --no-renames, too.

git log --no-merges --no-renames --numstat --pretty=format:"%x09%x09%x09%aN"

We read the log output into a Pandas' DataFrame by using the method described in this blog post, but slightly modified (because we need less data):



In [1]:

    
import git
from io import StringIO
import pandas as pd

# connect to repo
git_bin = git.Repo("../../buschmais-spring-petclinic/").git

# execute log command
git_log = git_bin.execute('git log --no-merges --no-renames --numstat --pretty=format:"%x09%x09%x09%aN"')

# read in the log
git_log = pd.read_csv(StringIO(git_log), sep="\x09", header=None, names=['additions', 'deletions', 'path','author'])

# convert to DataFrame
commit_data = git_log[['additions', 'deletions', 'path']].join(git_log[['author']].fillna(method='ffill')).dropna()
commit_data.head()









    Out[1]:







  
    
      
      additions
      deletions
      path
      author
    
  
  
    
      1
      1
      1
      docs/README.md
      Markus Harrer
    
    
      3
      76
      0
      docs/README.md
      Markus Harrer
    
    
      4
      290
      0
      docs/assets/css/style.scss
      Markus Harrer
    
    
      5
      -
      -
      docs/documentation/images/class-diagram.png
      Markus Harrer
    
    
      6
      1224
      0
      docs/documentation/index.html
      Markus Harrer

Getting data that matters

In this example, we are only interested in Java source code files that still exist in the software project.

We can retrieve the existing Java source code files by using Git's ls-files combined with a filter for the Java source code file extension. The command will return a plain text string that we split by the line endings to get a list of files. Because we want to combine this information with the other above, we put it into a DataFrame with the column name path.



In [2]:

    
existing_files = pd.DataFrame(git_bin.execute('git ls-files -- *.java').split("\n"), columns=['path'])
existing_files.head()









    Out[2]:







  
    
      
      path
    
  
  
    
      0
      src/main/java/org/springframework/samples/petc...
    
    
      1
      src/main/java/org/springframework/samples/petc...
    
    
      2
      src/main/java/org/springframework/samples/petc...
    
    
      3
      src/main/java/org/springframework/samples/petc...
    
    
      4
      src/main/java/org/springframework/samples/petc...

The next step is to combine the commit_data with the existing_files information by using Pandas' merge function. By default, merge will

combine the data by the columns with the same name in each DataFrame
only leave those entries that have the same value (using an "inner join").

In plain English, merge will only leave the still existing Java source code files in the DataFrame. This is exactly what we need.



In [3]:

    
contributions = pd.merge(commit_data, existing_files)
contributions.head()









    Out[3]:







  
    
      
      additions
      deletions
      path
      author
    
  
  
    
      0
      4
      5
      src/test/java/org/springframework/samples/petc...
      Antoine Rey
    
    
      1
      53
      0
      src/test/java/org/springframework/samples/petc...
      Colin But
    
    
      2
      25
      7
      src/test/java/org/springframework/samples/petc...
      Antoine Rey
    
    
      3
      167
      0
      src/test/java/org/springframework/samples/petc...
      Colin But
    
    
      4
      21
      9
      src/test/java/org/springframework/samples/petc...
      Antoine Rey

We can now convert some columns to their correct data types. The columns additions and deletions columns are representing the added or deleted lines of code as numbers. We have to convert those accordingly.



In [4]:

    
contributions['additions'] = pd.to_numeric(contributions['additions'])
contributions['deletions'] = pd.to_numeric(contributions['deletions'])
contributions.head()









    Out[4]:







  
    
      
      additions
      deletions
      path
      author
    
  
  
    
      0
      4
      5
      src/test/java/org/springframework/samples/petc...
      Antoine Rey
    
    
      1
      53
      0
      src/test/java/org/springframework/samples/petc...
      Colin But
    
    
      2
      25
      7
      src/test/java/org/springframework/samples/petc...
      Antoine Rey
    
    
      3
      167
      0
      src/test/java/org/springframework/samples/petc...
      Colin But
    
    
      4
      21
      9
      src/test/java/org/springframework/samples/petc...
      Antoine Rey

Calculating the knowledge about code

We want to estimate the knowledge about code as the proportion of additions to the whole source code file. This means we need to calculate the relative amount of added lines for each developer. To be able to do this, we have to know the sum of all additions for a file.

Additionally, we calculate it for deletions as well to easily get the number of lines of code later on.

We use an additional DataFrame to do these calculations.



In [5]:

    
contributions_sum = contributions.groupby('path').sum()[['additions', 'deletions']].reset_index()
contributions_sum.head()









    Out[5]:







  
    
      
      path
      additions
      deletions
    
  
  
    
      0
      src/main/java/org/springframework/samples/petc...
      111
      0
    
    
      1
      src/main/java/org/springframework/samples/petc...
      70
      23
    
    
      2
      src/main/java/org/springframework/samples/petc...
      67
      19
    
    
      3
      src/main/java/org/springframework/samples/petc...
      290
      137
    
    
      4
      src/main/java/org/springframework/samples/petc...
      79
      23

We also want to have an indicator about the quantity of the knowledge. This can be achieved if we calculate the lines of code for each file, which is a simple subtraction of the deletions from the additions (be warned: this does only work for simple use cases where there are no heavy renames of files as in our case).



In [6]:

    
contributions_sum['lines'] = contributions_sum['additions'] - contributions_sum['deletions']
contributions_sum.head()









    Out[6]:







  
    
      
      path
      additions
      deletions
      lines
    
  
  
    
      0
      src/main/java/org/springframework/samples/petc...
      111
      0
      111
    
    
      1
      src/main/java/org/springframework/samples/petc...
      70
      23
      47
    
    
      2
      src/main/java/org/springframework/samples/petc...
      67
      19
      48
    
    
      3
      src/main/java/org/springframework/samples/petc...
      290
      137
      153
    
    
      4
      src/main/java/org/springframework/samples/petc...
      79
      23
      56

We combine both DataFrames with a merge analog as above.



In [7]:

    
contributions_all = pd.merge(
    contributions, 
    contributions_sum, 
    left_on='path', 
    right_on='path', 
    suffixes=['', '_sum'])
contributions_all.head()









    Out[7]:







  
    
      
      additions
      deletions
      path
      author
      additions_sum
      deletions_sum
      lines
    
  
  
    
      0
      4
      5
      src/test/java/org/springframework/samples/petc...
      Antoine Rey
      57
      5
      52
    
    
      1
      53
      0
      src/test/java/org/springframework/samples/petc...
      Colin But
      57
      5
      52
    
    
      2
      25
      7
      src/test/java/org/springframework/samples/petc...
      Antoine Rey
      192
      7
      185
    
    
      3
      167
      0
      src/test/java/org/springframework/samples/petc...
      Colin But
      192
      7
      185
    
    
      4
      21
      9
      src/test/java/org/springframework/samples/petc...
      Antoine Rey
      134
      9
      125

Identify knowledge hotspots

OK, here comes the key: We group all additions by the file paths and the authors. This gives us all the additions to a file per author. Additionally, we want to keep the sum of all additions as well as the information about the lines of code. Because those are contained in the DataFrame multiple times, we just get the first entry for each.



In [8]:

    
grouped_contributions = contributions_all.groupby(
    ['path', 'author']).agg(
    {'additions' : 'sum',
     'additions_sum' : 'first',
     'lines' : 'first'})
grouped_contributions.head(10)









    Out[8]:







  
    
      
      
      additions
      additions_sum
      lines
    
    
      path
      author
      
      
      
    
  
  
    
      src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java
      Antoine Rey
      111
      111
      111
    
    
      src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java
      Antoine Rey
      3
      70
      47
    
    
      Faisal Hameed
      1
      70
      47
    
    
      Gordon Dickens
      14
      70
      47
    
    
      Michael Isvy
      51
      70
      47
    
    
      boly38
      1
      70
      47
    
    
      src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java
      Antoine Rey
      3
      67
      48
    
    
      Gordon Dickens
      15
      67
      48
    
    
      Michael Isvy
      49
      67
      48
    
    
      src/main/java/org/springframework/samples/petclinic/model/Owner.java
      Antoine Rey
      14
      290
      153

Now we are ready to calculate the knowledge "ownership". The ownership is the relative amount of additions to all additions of one file per author.



In [9]:

    
grouped_contributions['ownership'] = grouped_contributions['additions'] / grouped_contributions['additions_sum']
grouped_contributions.head()









    Out[9]:







  
    
      
      
      additions
      additions_sum
      lines
      ownership
    
    
      path
      author
      
      
      
      
    
  
  
    
      src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java
      Antoine Rey
      111
      111
      111
      1.000000
    
    
      src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java
      Antoine Rey
      3
      70
      47
      0.042857
    
    
      Faisal Hameed
      1
      70
      47
      0.014286
    
    
      Gordon Dickens
      14
      70
      47
      0.200000
    
    
      Michael Isvy
      51
      70
      47
      0.728571

Having this data, we can now extract the author with the highest ownership value for each file. This gives us a list with the knowledge "holder" for each file.



In [10]:

    
ownerships = grouped_contributions.reset_index().groupby(['path']).max()
ownerships.head(5)









    Out[10]:







  
    
      
      author
      additions
      additions_sum
      lines
      ownership
    
    
      path
      
      
      
      
      
    
  
  
    
      src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java
      Antoine Rey
      111
      111
      111
      1.000000
    
    
      src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java
      boly38
      51
      70
      47
      0.728571
    
    
      src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java
      Michael Isvy
      49
      67
      48
      0.731343
    
    
      src/main/java/org/springframework/samples/petclinic/model/Owner.java
      Michael Isvy
      164
      290
      153
      0.565517
    
    
      src/main/java/org/springframework/samples/petclinic/model/Person.java
      Michael Isvy
      59
      79
      56
      0.746835

Preparing the visualization

Reading tables is not as much fun as a good visualization. I find Adam Tornhill's suggestion of an enclosure or bubble chart very good:

Source: Thorsten Brunzendorf (@thbrunzendorf)

The visualization is written in D3 and just need data in a specific format called "flare". So let's prepare some data for this!

First, we calculate the responsible author. We say that an author that contributed more than 70% of the source code is the responsible person that we have to ask if we want to know something about the code. For all the other code parts, we assume that the knowledge is distributed among different heads.



In [11]:

    
plot_data = ownerships.reset_index()
plot_data['responsible']  = plot_data['author']
plot_data.loc[plot_data['ownership'] <= 0.7, 'responsible']  = "None"
plot_data.head()









    Out[11]:







  
    
      
      path
      author
      additions
      additions_sum
      lines
      ownership
      responsible
    
  
  
    
      0
      src/main/java/org/springframework/samples/petc...
      Antoine Rey
      111
      111
      111
      1.000000
      Antoine Rey
    
    
      1
      src/main/java/org/springframework/samples/petc...
      boly38
      51
      70
      47
      0.728571
      boly38
    
    
      2
      src/main/java/org/springframework/samples/petc...
      Michael Isvy
      49
      67
      48
      0.731343
      Michael Isvy
    
    
      3
      src/main/java/org/springframework/samples/petc...
      Michael Isvy
      164
      290
      153
      0.565517
      None
    
    
      4
      src/main/java/org/springframework/samples/petc...
      Michael Isvy
      59
      79
      56
      0.746835
      Michael Isvy

Next, we need some colors per author to be able to differ them in our visualization. We use the two classic data analysis libraries for this. We just draw some colors from a color map here for each author.



In [12]:

    
import numpy as np
from matplotlib import cm
from matplotlib.colors import rgb2hex

authors = plot_data[['author']].drop_duplicates()
rgb_colors = [rgb2hex(x) for x in cm.RdYlGn_r(np.linspace(0,1,len(authors)))]
authors['color'] = rgb_colors
authors.head()









    Out[12]:







  
    
      
      author
      color
    
  
  
    
      0
      Antoine Rey
      #006837
    
    
      1
      boly38
      #39a758
    
    
      2
      Michael Isvy
      #9dd569
    
    
      9
      Tomas Repel
      #e3f399
    
    
      42
      Tejas Metha
      #fee999

Then we combine the colors to the plot data and whiten the minor ownership with all the None responsibilities.



In [13]:

    
colored_plot_data = pd.merge(
    plot_data, authors, 
    left_on='responsible', 
    right_on='author', 
    how='left', 
    suffixes=['', '_color'])
colored_plot_data.loc[colored_plot_data['responsible'] == 'None', 'color'] = "white"
colored_plot_data.head()









    Out[13]:







  
    
      
      path
      author
      additions
      additions_sum
      lines
      ownership
      responsible
      author_color
      color
    
  
  
    
      0
      src/main/java/org/springframework/samples/petc...
      Antoine Rey
      111
      111
      111
      1.000000
      Antoine Rey
      Antoine Rey
      #006837
    
    
      1
      src/main/java/org/springframework/samples/petc...
      boly38
      51
      70
      47
      0.728571
      boly38
      boly38
      #39a758
    
    
      2
      src/main/java/org/springframework/samples/petc...
      Michael Isvy
      49
      67
      48
      0.731343
      Michael Isvy
      Michael Isvy
      #9dd569
    
    
      3
      src/main/java/org/springframework/samples/petc...
      Michael Isvy
      164
      290
      153
      0.565517
      None
      NaN
      white
    
    
      4
      src/main/java/org/springframework/samples/petc...
      Michael Isvy
      59
      79
      56
      0.746835
      Michael Isvy
      Michael Isvy
      #9dd569

Visualizing

The bubble chart needs D3's flare format for displaying. We just dump the DataFrame data into this hierarchical format. As for hierarchy, we use the Java source files that are structured via directories.



In [14]:

    
import os
import json

    
json_data = {}
json_data['name'] = 'flare'
json_data['children'] = []

for row in colored_plot_data.iterrows():
    series = row[1]
    path, filename = os.path.split(series['path'])

    last_children = None
    children = json_data['children']

    for path_part in path.split("/"):
        entry = None

        for child in children:
            if "name" in child and child["name"] == path_part:
                entry = child
        if not entry:
            entry = {}
            children.append(entry)

        entry['name'] = path_part
        if not 'children' in entry: 
            entry['children'] = []

        children = entry['children']
        last_children = children

    last_children.append({
        'name' : filename + " [" + series['responsible'] + ", " + "{:6.2f}".format(series['ownership']) + "]",
        'size' :  series['lines'],
        'color' : series['color']})

with open ( "vis/flare.json", mode='w', encoding='utf-8') as json_file:
    json_file.write(json.dumps(json_data, indent=3))

Results

You can see the complete, interactive visualization here. Just tap at one of the bubbles and you will see how it works.

The source code files are ordered hierarchically into bubbles. The size of the bubbles represents the lines of code and the different colors stand for each developer.

![](./resources/knowledge_island_1.png)

On the left side, you can see that there are some red bubbles. Drilling down, we see that one developer did add almost all the code for the tests:

On the right side, you see that some knowledge is evenly distributed (white bubbles), but there are also some knowledge islands. Especially the PetClinicInitializer.java class got my attention because it's big and only one developer knows what's going on here:

I also did the analysis for the huge repository of IntelliJ IDEA Community Edition. It contains over 170000 commits for 55391 Java source code files. The visualization works even here (it's just a little bit slow and confusing), but the flare.json file is almost 30 MB and therefore it's not practical for viewing it online. But here is the overview picture:

Summary

We can quickly create an impression about the knowledge distribution of a software system. With the bubble chart visualization, you can get an overview as well as detailed information about the contributors of your source code.

But I want to point out two points against this method:

Renamed or split source files will also get new file names. This will "reset" the history for older, renamed files. Thus developers that added code before a rename or split aren't included in the result. But we could argue that they can't remember "the old code" anyhow ;-)
We use additions as proxy for knowledge. Developers could also gain knowledge by doing code reviews or working together while coding. We cannot capture those constellations with such a simply analysis.

But as you have seen, the analysis can guide you nevertheless and gives you great insights very quickly.

	additions	deletions	path	author
1	1	1	docs/README.md	Markus Harrer
3	76	0	docs/README.md	Markus Harrer
4	290	0	docs/assets/css/style.scss	Markus Harrer
5	-	-	docs/documentation/images/class-diagram.png	Markus Harrer
6	1224	0	docs/documentation/index.html	Markus Harrer

	path
0	src/main/java/org/springframework/samples/petc...
1	src/main/java/org/springframework/samples/petc...
2	src/main/java/org/springframework/samples/petc...
3	src/main/java/org/springframework/samples/petc...
4	src/main/java/org/springframework/samples/petc...

	additions	deletions	path	author
0	4	5	src/test/java/org/springframework/samples/petc...	Antoine Rey
1	53	0	src/test/java/org/springframework/samples/petc...	Colin But
2	25	7	src/test/java/org/springframework/samples/petc...	Antoine Rey
3	167	0	src/test/java/org/springframework/samples/petc...	Colin But
4	21	9	src/test/java/org/springframework/samples/petc...	Antoine Rey

		additions	additions_sum	lines
path	author
src/main/java/org/springframework/samples/petclinic/PetclinicInitializer.java	Antoine Rey	111	111	111
src/main/java/org/springframework/samples/petclinic/model/BaseEntity.java	Antoine Rey	3	70	47
	Faisal Hameed	1	70	47
	Gordon Dickens	14	70	47
	Michael Isvy	51	70	47
	boly38	1	70	47
src/main/java/org/springframework/samples/petclinic/model/NamedEntity.java	Antoine Rey	3	67	48
	Gordon Dickens	15	67	48
	Michael Isvy	49	67	48
src/main/java/org/springframework/samples/petclinic/model/Owner.java	Antoine Rey	14	290	153

	path	author	additions	additions_sum	lines	ownership	responsible
0	src/main/java/org/springframework/samples/petc...	Antoine Rey	111	111	111	1.000000	Antoine Rey
1	src/main/java/org/springframework/samples/petc...	boly38	51	70	47	0.728571	boly38
2	src/main/java/org/springframework/samples/petc...	Michael Isvy	49	67	48	0.731343	Michael Isvy
3	src/main/java/org/springframework/samples/petc...	Michael Isvy	164	290	153	0.565517	None
4	src/main/java/org/springframework/samples/petc...	Michael Isvy	59	79	56	0.746835	Michael Isvy

	author	color
0	Antoine Rey	#006837
1	boly38	#39a758
2	Michael Isvy	#9dd569
9	Tomas Repel	#e3f399
42	Tejas Metha	#fee999